Is Matching Pursuit Solving Convex Problems? 
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Abstract — Sparse recovery (sr) has emerged as a very powerful tool for signal processing, data mining and pattern recognition. 
To solve SR, many efficient matching pursuit (mp) algorithms have been proposed. However, it is still not clear whether SR can be 
formulated as a convex problem that is solvable using mp algorithms. To answer this, in this paper, a novel convex relaxation model is 
presented, which is solved by a general matching pursuit (gmp) algorithm under the convex programming framework, gmp has several 
advantages over existing methods. At first, it solves a convex problem and guarantees to converge to a optimum. In addition, with 
£i-regularization, it can recover any fc-sparse signals if the restricted isometry constant o-fe < 0.307 - u, where u can be arbitrarily 
close to 0. Finally when dealing with a batch of signals, the computation burden can be much reduced using a batch-mode gmp. 
Comprehensive numerical experiments show that gmp achieves better performance than other methods in terms of sparse recovery 
ability and efficiency We also apply gmp to face recognition tasks on two well-known face databases, namely Extended using and AR. 
Experimental results demonstrate that gmp can achieve better recognition performance than the considered state-of-the-art methods 
within acceptable time. Particularly the batch-mode gmp can be up to 500 times faster than the considered ii methods. 

Index Terms — Sparse recovery compressive sensing, matching pursuit, convex programming, face recognition. 



1 Introduction 

Sparse recovery (SR), also known as sparse representation or 
sparse reconstruction, has been widely used in many applica- 
tions, such as signal processing and machine learning H], ||2|, 
|[3l . Typically, SR is a fundamental element of the compressive 
sensing [I], lH. Besides, SR has been successfully applied 
to image processing tasks, such as image restoration ||5], lH, 
image super-resolution ||7] and so on. In machine learning area, 
SR has been widely applied in robust face recognition ||2l, 
subspace clustering iH, 191, dictionary learning llTOl . llTTI and 
feature learning ||3], lfT2l . 

From the compressive sensing theory lH], SR guarantees to 
recover a sparse signal from underdetermined linear systems 
under some restricted conditions. Consider a linear model b = 
Ax + e, where A = [ai, ...a™] G M"^™(n <C m) is a design 
matrix or dictionary, aj e M" denotes the jth atom and e £ 
R" is the additive noise, SR aims to recover a fc-sparse signal 
X from b by solving the following £o-norm constrained inverse 
problem: 



min ||b — Ax|| s.t. ||x||o < k, 



(1) 



where || • || and || • ||o denote the £2 and £0 norm, respectively. 
Here x is called as the sparse representation of b. However, 
to successfully recover the sparse signals from the the mea- 
surement, restricted conditions are necessary. The restricted 
isometry property (RIP) is one of the important conditions Q. 
Specifically, for an n x to design matrix A and an integer 
k G [1,™], the k-restricted isometry constant dk G [Oj 1) is 
the smallest number such that. 



(1 



4)||xf < IIAxf < + 



(2) 



for all X with ||x||o < k. Studies have shown that it is 
possible to recover x with n = 0(fclog(m)) nonadaptive 
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measurements by solving ^ IT], [j?], lfT3l. However, be- 
cause of the £Q-norm constraint, problem ([l) is NP-hard to 
solve lfT4l . Therefore, practitioners usually solve the its bi- 
convex relaxations ifTsl . lfT6l . 



X 1 



s.t. b-Axei3f, 



(3) 



where is determined by the noise structure with p E 
{0, 2, 00} IITtI . Specifically, = {0} represents the noiseless 
case, = : \\^\\ < e} denotes the ^2-norm noise, and 
B°° = {^: IIA^IIloo < A} denotes the Dantzig selector IfTSl , 
IITtI . respectively. The recovery conditions of f i-relaxation 
have been widely studied by many researchers |[T], |[T3l . 
[:17|. For example, [l?] shows that if tTfe satisfies ak < 0.307, 
the solution to (|3]l can recover the fc-sparse signals exactly. 
Let A be a regularization parameter, in practice, the following 
£1 -regularized LASSO problem is also widely studied |[T9l : 



A||x| 



Axl 



(4) 



The major issue of £1 -relaxations is the unbearable compu- 
tational cost II20I , Il2n . II22I . Although many fast algorithms 
have been proposed recently, they are still expensive due to 
the frequent calculations of Ax and A^^ 1231 . In real-world 
applications, to make it solvable with acceptable cost, the 
number of n has to be reduced using random projections 
or PC A 121, l2n . However, with dimension reduction, the 
recognition performance may decrease llSTI . Typically, two 
recent papers argued that the simple least square regression 
(denote by L2) or the i'2-regularized least square regression 
(denote by L2-L2) can achieve better recognition rates and 
faster recognition speed than SR on face recognition tasks 
without dimension reduction 11211 . Il22l . Furthermore, since 
the solutions of L2 and L2-L2 are not sparse but provide 
comparable performance, the authors argued that the enforcing 
of sparsity constraints may not help improve the recognition 
performance but bring heavy computational cost ||201 . 11211 . 

Regardless of the arguments on the effectiveness of sparsity, 
the heavy computational cost of solving £1 -relaxations is 
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widely acknowledged Q, 1201 . II2TII . Il22l . Accordingly, how to 
improve the efficiency of SR is an urgent issue. Besides £1 con- 
vex relaxations, matching pursuit (MP) includes another family 
of SR methods |l24l, JSS], 1231. Among them, the orthogonal 
matching pursuit (OMP) or the fully corrective forward greedy 
selection (FCFGS) is the most well known l26l, JStI. The 
greedy idea of OMP has also been extended to solve more 
general sparse problems with slight modifications ||28l . ||29l . 
Unfortunately, these OMP type methods, which only include 
one atom in each iteration, are very expensive for computation 
especially when the sparsity k is large. 

To improve the efficiency, many OMP variants have been 
proposed, including the regularized orthogonal matching pur- 
suit (ROMP) l30l , compressive sampling matching pursuit 
(CoSaMP) l24l, subspace pursuit (SP) ES, accelerated it- 
erative hard thresholding (AIHT) El, EH, El, orthogonal 
matching pursuit with replacement (OMPR) l23l and so on. For 
these variants, the computational cost can be much reduced 
since they only calculate A^^ several times ll23l . Therefore, 
they are more attractive when solving large-scale problems. 
The convergence and recovery conditions of these algorithms 
have also been thoroughly studied, which are usually ex- 
pressed as RIP constants 1211, l25l, l23l. Typically, CoSaMP, 
SP, AIHT and OMPR can recover any fc-sparse signal provided 
with 54k < 0.35, Ssk < 0.35, Jgfc < l/%/32 and < 0.499, 
respectively l23l . 

However, the major disadvantage of the above OMP variants 
(except AIHT) is that the recovery conditions are also their 
corresponding convergence guarantees. In other words, if 
these conditions do not hold, their convergence cannot be 
guaranteed. While the RIP conditions are not difficult to satisfy 
in compressive sensing, it may be too strict to achieve in real- 
world applications. For example, in the face recognition tasks, 
several images of the same person may be highly correlated. 
In this case, dtk can be very large even for small k, where 
t > 1 is an integer. In other words, the RIP conditions will be 
violated. Consequently, the above improved OMP variants may 
not converg41l Another disadvantage of the OMP variants is 
that, most of them, such as SP, require a good estimation of the 
sparsity fc; otherwise, their performance will decrease. Unfor- 
tunately, in real-world applications, the estimation of ground- 
truth sparsity is not trivial, which restricted the applications of 
these methods. 

Regarding the computational issue of £i-convex relaxations 
and OMP type methods, and the convergence issue of the 
improved OMP variants, the core contributions of this paper 
are summarized as follows: 

• We propose a new convex relaxation for sparse recovery. 
To solve the new relaxation, a general matching pursuit 
(GMP) algorithm, which includes the standard OMP as a 
special case, is introduced. Since GMP solves a convex 
problem, its convergence is guaranteed. In addition, GMP 
does not need to specify the estimation of the target 
sparsity, which is desired in many real-world applications. 
Finally, a subspace exploratory matching is presented to 



further improve the performance of GMP. 

• We show that with £1 regularizer, GMP guarantees to 
recover any fc-sparse signals if Sk < 0.307 — i^, where 
v can be arbitrarily close to 0. Furthermore, under mild 
conditions, GMP can exponentially converge to the op- 
timal solution. A collection of numerical experiments 
demonstrate the effectiveness and efficiency of GMP on 
sparse recovery problems. 

• The sure convergence facilitates GMP for many real-world 
applications. In this paper, we apply GMP to face recogni- 
tion tasks. Experimental results on two well-known face 
datasets, namely Extended using and AR databases, show 
that GMP with batch mode implementation achieves 
better recognition rates than L2 and 'Ll-'Ll methods with 
comparable computational cost. Particularly, it can be up 
to 500 times faster than the state-of-the-art £1 solver 

The rest of this paper is organized as follows. In Section |2] 
a new convex relaxation for sparse recovery is presented. 
To solve this relaxation, in Section |3] we propose a general 
matching pursuit algorithm with detailed analysis. The related 
work is discussed in Section |4] Numerical experiments and 
the application of GMP to face recognition are shown in 
Section|5]and|6l respectively. Conclusive remarks will be given 
in Section |7] Besides, all the technical proofs of this paper are 
shown in supplementary files. 

2 A New Convex Relaxation Model for 
Sparse Recovery 

2.1 Notations and Preliminaries 

Throughout the paper, we denote the transpose of vec- 
tor/matrix by the superscript ^, as a zero vector and diag(v) 
as a diagonal matrix with diagonal entries equal to v. In 
addition, we denote ||vj|p and |jv|| as the ^p-norm and ^2 -norm 
of a vector v, respectively. For a function /(x), we denote by 
V/(x) and 9/(x) as the gradient and subgradient of /(x) at 
X, respectively. For a sparse vector x, let the calligraphic letter 
T = support(x) = {i\xi 7^ 0} e {1, ...,m} denote its support 
ITtII and X7- be the subvector indexed by T. Furthermore, let 
T'^ be the complementary set of T, i.e. = {!,..., m}\T. 
Finally, let A B represent the element-wise product of two 
matrices A and B. We also need the following definition of 
Restricted Eigenvalue Condition, Restricted Condition Number 
and Restricted Set. 

Definition 1. 4791/ Given an integer k > 0, a design matrix 
A is said to satisfy the Restricted Eigenvalue Condition at 
sparsity level k, if there exists positive constants ^-(A,k) 
and 7+ (A, fc) such that 

fx^A^Ax 1 

7_(A, fc) = inf ^ ,x / 0, l|x|lo < fc , (5) 



' X A Ax 

7+(A,fc) =sup<j ,X7^ 0,||x||o < fc J> . (6) 



In addition, the Restricted Condition Number is defined as 

7+(A,fc) 



k(A, fc) = 



(7) 



1 . An experiment on a non-RIP toy problem will be given in Section 15.2.31 
(shown as in Fig. [5) to demonstrate this issue. 



7_(A,fc)- 

Lemma 1. Let k < fc', then we have 7+ (A, fc) < 7+ (A, fc'), 
7_(A,fc) > 7_(A, fc') and K(A,fc) < K(A,fc'). 
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Definition 2. Restricted Set l\34]l . Given an index set T, the 
restricted set is defined as Td = {h € R™ : ||h7-c||i < 
-D||h7-||i}, where denotes the complementary set ofT. 

2.2 A New Convex Relaxation Model 

Notice that SR can be considered as a support detection 
problem 1231 . Motivated by this fact, we introduce a support 
selection vector r e {0, 1}™ to select the atoms, where the ith 
atom is selected if and only if = 1. In addition, to enforce 
the sparsity, we impose a sparsity constraint ||t||o < g < k 
to control the number of selected atoms. Here we use g to 
distinguish the target sparsity k in equation (1) for that g is 
only a conservative estimation (namely, g can be several times 
smaller than k) to the ground-truth sparsity, which is often 
unknown in advance in real-world problems. 

With the introduction of the support selection vector r, we 
now give a new model for SR. For simplicity, let A = {x : 
\\t\\o ^ Qt't & {0,1}™} be the domain of t and for each 
r G A, Adiag(r)x = A(x x). Let ^ = b - A(r x) be 
the regression error, based on the formulation (2], we consider 
to solve an alternative model: 

Ml: minmin A||x||i + ijl^H^ (8) 

TSA 2 

S.t. ^ = b - A(x0t), ||t||o < p. 

It is also interesting to solve the following problem by setting 
A = 0. 

M2: minmin i||£|P (9) 

s.t. £ = b - A(x0 r), ||t||o < £1. 

As A/2 is a special case of Ml, hereafter we will not 
differentiate them unless specified. Note that in both ^ 
and (|9]l, there are X]f=o (T) feasible t in A. The tasks of 
the optimization problems in dHJ and (|9]l are to find the 
BEST T from the feasible set to minimize the regression 
loss 1 141 p. Unfortunately, these two optimization problems 
are very difficult to solve due to the exponential number of 
constraints. To make them solvable, we firstly conduct the 
following transformations. 

Proposition 1. By introducing the dual form of the inner 
minimization problem. Problem dS]) can be reformulated as 

min max — i llalP + a^b (10) 
T ex. 2 

s.t. ||Q:^ArfM^(T)||oo < A, ||t||o < £1, 

where a. ^ ^ is the dual variable to the constraint ^ = b — 
A(x t). If \ = 0, the constraint becomes o.^ Adiag{T) = 
Oi\\t\\o < g for M2. Finally, let x* be the optimal solution 
of problem Ml or M2, then x* = if Tj = 0. 

In Proposition [T] we have a = ^ at the optimality, 
which is particularly important in the following derivations. 
At first, as the regression error ^ is always bounded, we 
have 4 = b for x = 0. Without loss of generality, we 
can define a feasible domain for a w.rt. a feasible x as 

= {a : ||aTAdiag(x)||oo < A, a e [-1,1]''} to 
Here Z > is a large number such that the optimal solution 
a* always exists and is achievable in the defined domain. In 
addition, we define: 

/(a,T) = - a^b, aeAr- (11) 



Moreover, for a given r, define by maxQ- —f{a., t) as 
the maximizer over a and miiixeA maxcK — /(a, x) = 
minxeA(niaxQ; — /(cc, t)) as the minimizer among all pos- 
sible r e A. Since |A| is finite, the minimizer is always 
achievable by enumeration. However, this enumeration process 
will become computationally intractable for large m and g, 
in which |A| becomes extremely huge. Fortunately, we can 
make the convex relaxation to (fTol i according to the following 
theorem. 

Theorem 1. (Convex relaxation) With and A defined 
above, we have 

min max — f (a. t) > maxmin — ffa. r). (12) 

reA a .< v , / _ ^ v - / 

Notice that the latter problem in formulation (ITZt is the 
convex relaxation to min^eA maxa —f{a, t), which can be 
further transformed as a QCQP problem. 

Proposition 2. (QCQP for SRj. By introducing a new variable 
e R, maxQ minx FA —fjcx, t) is equivalent to a convex 
QCQP problem l[35\l : 

min 9: s.t. f(a.T)<6, V r G A. a G yl^ (13) 

In the rest of the paper, we will focus on solving the QCQP 
problem (flJl i for SR. Unfortunately, since the number of 
constraints is as many as J2i=o (T)' ^^'^ become very huge 
for median size m and g, making (T3[ intractable. Therefore, 
how to solve it efficiently is our core concern in the following 
sections. 

3 General Matching Pursuit 

Although there are J2i=o ( T) constraints defined in A, the 
intrinsic sparsity of the SR problem ensures that the number 
of active constraints is very small. Ideally, for the fc-sparse 
signal X, there exists at least \k/ gl optimal r's such that the 
support of X can be included by these r's. In other words, 
the number of active r's is very small. The question is: how 
can we gradually infer these active r's from the exponential 
number of candidates? 

Notice that, problem ( fT3] l can be considered as a special case 
of the semi-infinite programming (SIP) problem II36I . Since 
the SIP problem can be efficiently addressed by a central 
cutting plane (CCP) algorithm 1361 . we seek to address (fT3T l 
by adopting CCP algorithm. Follow ll36l . the details of the 
CCP algorithm for solving (fT3] l are described in Algorithm [T] 



Algorithm 1 Central cutting plane method for (fT3l . 

0: Initialize a'^ = b, and do worst-case analysis to find tlie most-violated 
constraint tq. 

1: Let e" = f(oL°,To), Ao = {to} and t = 1. 
2: Solve the following program: 

max 5 s.t. 6 + 5 < 6''"\ f(a,Tt) < - 5.\/Tt e At_i. (14) 

3: Let (a', 9', 5') be the solution to I14t . If the stopping condition 
achieves, stop. 

4: Obtain a new violated constraint xt by worst-case analysis. 

5: Set Af = At_i U {xt}. If /(Q;',xt) > 6*', that is, a* is an infeasible 
solution to )13t . and then add the constraint f{ct, xt) < 9 — S to SDt. Let 
i = i + 1 and go to Step 2. 
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The basic idea of Algorithm [T] is: instead of solving the 
SIP with all constraints defined by A, it iteratively finds one 
possible active constraint through worst-case analysis and 
adds it into the active constraint set At that is initialized as 
0. After that, the master problem (Il4t with active constraints 
will be solved. An important issue for the worst-case analysis 
is the initial guess of q;°. Typically, since x'' = holds at the 
initial stage, we have ^ = b. Accordingly, the dual variable 
a" can be set by a" — b. Interested readers can refer to ll3Tl . 
Il36l for the detailed discussions of this algorithm. The details 
of the worst-case analysis, the solution to the master problem 
and the stopping conditions will be discussed in the following 
subsections. 



3.1 Worst-Case Analysis 

In Algorithm 1, the first step is to find the most active r 
from the exponential number candidates, which is named as 
the worst-case analysis. Recall that, at the optimality of ( fT3] l, 
the following conditions should hold: 



|a^Adiag(T)|U < A. 



(15) 



Let g = A^a, apparently, any atom with \gi\ > X violates 
the optimality condition and the atom with the largest \gi\ 
violates the condition the most. Since |t|o < g, the g atoms 
with the largest \gi\ will form an active constraint. To be more 
specific, to obtain the most-active Xf, we can set the g entries 
of Tt with the largest \gi\ to 1 and the rests to 0. 

Recall that this strategy is akin to the matching step in 
traditional MP algorithms 124), ||23l, ||26l. Therefore, we refer 
Algorithm 1 as general matching pursuit (GMP) algorithm. 
The effectiveness as well as the deficiency of the worst-case 
analysis will be further studied later 

3.2 Master Problem Optimization 

In Algorithm 2, the second step is to solve the master problem 
(O with the selected active constraints. To begin with, we 
make the following notations. At first, let J'i be a index set 
of atoms selected by r^, where includes g active atoms. 
In the worst-case analysis, it is easy to ensure that there is no 
overlapping elements among J'i, for all i. In the following, let 
~ ^iJi be the index set of all selected atoms at the tth 
iteration. Finally, let T = |At| be the size of active constraints 
for the tth iteration, we have T < t, where the inequality 
holds if some non-active constraints are deleted from At during 
the optimization f36l. The following lemma indicates that the 
master problem (fl4t can be solved in the dual. 

Lemma 2. In the master problem ( 1741 ) of the tth iteration, let 
{v} and {fti} be the optimal dual variables, and e 11 = 

{^l\^Ji h 0,J2Li = problem ^ can be solved in the 
following dual form: 



mm max 
fien cteA 



(16) 



Unfortunately, it is very difficult to solve the minimax 
problem ( fT6] l directly. Particularly, the computational cost is 
very expensive if n is very large. Motivated by the fact that, 
problem (fTST i only relates to a small set of active atoms, namely 



\It\ = Tg <C n, we seek to solve it w.rt. the sparse primal 
variable x. Fortunately, the following theorem provides such 
feasibility. 

Tlieorem 2. Suppose no overlapping elements exist among Ji 
and If = y^iJi, problem ( 1761 ) can be addressed by solving an 
£i-regularized problem with variables ^ and xw.rt. the subset 



min A||x||i 

^4 



s.t. ^ = b — Ax, xic 



(17) 



If X = 0, it becomes a least square regression problem. 
Furthermore, cx can be recovered by cx = 

3.3 GMP in Primal 

Now that the master problem can be addressed w.rt. variables 
X and ^, we can rewrite Algorithm 1 in the primal form. 
Notice that Algorithm 1 includes two-layer loops: the outer 
loop for CCP iteration and the inner loop for solving the master 
problem. To differentiate the two loops, we use t as the outer 
loop index and k as the inner loop index. Furthermore, we 
denote by x* the outer iteration variable and u'^' the inner 
iteration variable. Recall that the master problem is defined 
on the atoms indexed by It, with the relation a = ^, GMP in 
Algorithm 1 can be implemented in the primal form, which is 
shown in Algorithm |2] 

In Algorithm |2] the master problem is solved by prox- 
imal gradient (PG) or conjugate gradient descent (CGD) 
method 1381 . Il39l , which will be depicted later Moreover, 
to accelerate the convergence, when doing master problem 
optimization, we set u^^ = x^~^ as a warm-start. Once the 
inner optimization is finished, we can set x^^ = u^^ and 
x^c = for the next warm-start, where vl^^ is optimal solution 
(or approximate optimal solution) obtained by PG or CGD 
iterations. It is worth mentioning that the computational cost 
for the master problem optimization takes only 0{n\lt \ ), since 
it is performed on the set If. 

Algoritlim 2 GMP in primal. 

Initialize x° = 0, q;° = b, To = 0- Let t = 1. 

1: Do worst-case analysis: Let g = A^o;*"^; choose the q largest 



\gj\ and record their indices by Jt- Let It 
2: Solve the master problem: 



Ujt. 



Initialize u'^^ = x^^ ^ . 



For k — 1, 

Update using PG (A > 0) or CGD (A = 0) rules. 
Quit if the stopping condition achieves. 
End. 

3: Let x^^ = u|, x^c = and a* = ^. 

4: Quit if the stopping condition achieves. Otherwise, \ett — t+1 
and go to step 1. 

For completeness, now we present the general idea to solve 
the primal problem ([TtT i. Since it is an £i -regularized problem, 
any existing £i solver can be adopted to solve it. In this paper, 
we choose the projected gradient (PG) method since it has 
attractive convergence preperites ll38l . Define </3(x) = ^||b — 
AxlP and 



/(x) = A||x||i+^(x) 



(18) 
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In PG, it iteratively minimizes the following local quadratic 
approximation of /(x) at a fixed point u: 

0(x, u) = A||x||i + ip{u) + V^(u)^(x - u) + ^^llx - u\\\ (19) 

where L is a positive number To iterate, PG 
reUes on the soft-thresholding operator: SL,\{o)i = 
sign(oi) max { |oi| — j;,0}- Moreover, let u*^ be the point at 
the fcth iteration of PG and g = Vi^(u'^) = A^(Au'^ — b), the 
minimizer of 0(x, u'^) can be calculated by Sl.\{u'' ~ s/L). 
Then the basic updating rule for PG can be written as: 

u'=+^ = SlA»'= - s/L) = u*-- - l/LGiu"), (20) 

where G(u'^) = L{u'' — S'l.a(u'^ — s/L)) is called as the 
generalized gradient and L can be adjusted by the Une search 
method ||38l . Interested readers can refer to Il38l for more 
details. 

When A = 0, the master problem becomes a least square 
regression problem and G(u'^) = g is the gradient of (p{u'')- 
Although it can also be solved by PG algorithm, other solvers 
such as CGD can achieve better efficiency l39] due to two 
reasons: firstly, the step size in CGD is calculated by the exact 
line search and the computation cost of the line search in 
PG can be avoided; secondly, CGD has a better factor for the 
convergence rate than PG, as shown by the following theorem. 

Theorem 3. ( 09l/ . SiMH) At the tth iteration of Algorithm 2, 
if ^^{A, tg) > or 1 — Stg > 0, (namely, the master problem 
is strongly convex w.r.t. the sub-variables indexed by T), both 
PG and CGD have geometric convergence rate w.r.t. the sub- 
variables xx- Specifically, let {u^} be the sequence generated 
by CGD or PG, u* be the minimizer w.r.t. the variables indexed 
by Xt and /(u*) be the optimal function value, ^(uO)_y[u»j — 
holds. Here, x = (\/k(Q) - !)/( ^^(0) + 1) is for CGD 
and X = ^1 ^ 4)) k(q) ) is for PG. In addition, according to 
Lemma [7] x ^''^ gradually increase with increasing t. 

Theorem |3] indicates that both PG and CGD can converge to 
the optimum after finite iterations if 7- (A, tg) > or 5tg > 0. 
Moreover, when t goes larger, the factor x will increase and 
the computation cost will increase, too. 

Now we discuss the relationship between GMP and the 
standard OMP ||26l. Recall that GMP iteratively takes the 
following three steps. At first, it includes one atom with the 
largest |a^r| into the support set by I = T U {j}, where 
r = b— Ax is the residual. After that, an orthogonal projection 
is performed by solving miux^^ ^ll^* ~ Aixijp. Finally, the 
residual r is updated by r = b — AjUj. We can observe that 
OMP is a special case of GMP when A = and g = 1. 

Remark 1. The classical OMP algorithm solves the convex 
problem l[I3\l regarding M2 with g ^ 1 and A = 0. 

3.4 Subspace Exploratory Matching 

Now we study the effectiveness of the worst-case analysis. 
First of all, it is interesting to measure the progress of each 
outer loop in GMP after the worst-case analysis. Typically, we 
have the following result. 



Lemma 3. Let G(x*) be the generalized gradient at x' and 
g = V(^(x*), JTt+i be the index set of atoms selected by the 
worst-case analysis and be the point after the first iteration 
of the inner loop, with proper line search, we have: 

/(x*)-/(u^)>^||Gj,^,(x*)||^ = ^ J2 i\9A-^f- (21) 

where \gi\ > Xfor \fi £ Jt+i. 

From Lemma |3] on the one hand, it indicates that, the 
selection of the g elements with the largest \gi \ can achieve the 
best lower bound on the objective value improvement for the 
first iteration of the inner loop. In other words, the worst-case 
analysis is effective. 



Algorithm 3 Subspace Exploratory Matching. 

0: Give dictionary A and Lj{io > 1), initialize cx — h. 
1: Calculate g = a. Choose the ujg atoms with the largest 
(jji's and record the indices in set J^j- Let X^j ~ Ttl-I Jui- 
2: Solve the master problem: 

Initialize u^^^ — x^~^. 

For k — 1, 

Update using PG (for Ml) or CGD (for M2). 
Quit if the stopping condition achieves. 
End. 

3: Sort those ujg atoms by their regressors \ui \ in descending order 
and return the first g atoms. 

On the other hand, this lower bound cannot be guaranteed 
for more inner iterations. Therefore, it may be suboptimal 
when g is relatively large, where some non-support atoms may 
be wrongly included. To address this issue, we can explore 
a larger search space. To be more specific, we can seek to 
include cog {oj > 1) elements with the largest \gi\ and then 
solve the master problem with all selected atoms. Finally, we 
only select the g atoms with larger regressors from the new 
atoms. This scheme, named as subspace exploratory matching, 
is summarized in Algorithm |3] With the subspace search, 
the convergence speed as well as the decoding performance 
can be improved. However, additional computational cost will 
be consumed to solve the master problem. Fortunately, since 
the inner master problem solver can geometrically converge, 
several iterations will be enough with warm-start. Finally, to 
distinguish the two matching schemes, hereafter we term the 
GMP with subspace search as SGMP. 

3.5 Stopping Conditions 

The stopping condition is important for MP algorithms. With 
properly selected ^ a natural stopping condition for GMP is 

||a^A||^ < A. (22) 

However, when A is arbitrarily small, GMP will converge to 
the point where ||q;|| < ||^||, where ^ is the noise. Apparently, 
this is meaningless. The same problem also exists in OMP. 
To address this issue in OMP, pTi proposes two stopping 
conditions: 

||q:^A||oo < '"oo and l|a||<r2, (23) 

2. The selection of the regularization parameter A depends on the structure 
of the noises 1401 . which is beyond the scope of this paper. 
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where 7'oo > and r2 > are pre-determined value based on 
the structures of different noises ATI . These two conditions 
are also applicable for GMP. However, in real applications, it 
is not easy to obtain the noise structures, making the setting 
of Too or r2 difficult. Therefore, they are not easy to use in 
real applications. To address this issue, we can use the relative 
function value difference as the stopping condition based on 
the CCP algorithm: 



where S' is the function value difference between the (t — l)th 
and <th iteration, e is a small tolerance and B'^ denotes 
the initial objective function value. In our problem, since 
9^ = ||b|p/2, the stopping condition becomes 2(5*/||b|p < e. 
Notice that this condition is appUcable for those methods that 
cannot monotonically converge. 

3.6 Convergence Analysis 

At first, we show that GMP can surely converge. 

Theorem 4. (Lemma 3.1 and Theorem 3.1 in l[36]l } The 

sequence {(5*} (5* > Oj generated in Algorithm 1 will converge 
to 0. Furthermore, there exists t, such that cc*"^ is feasible 
for Problem U3\l and Algorithm 1 stops in the tth iteration 
with a* to be the optimal solution (not necessarily unique) to 

m. 

It is worth mentioning that the above convergence property 
in Theorem |4] does not require any restricted conditions. 
That is to say, GMP will surely converge. In addition, we 
can further show that GMP can exponentially converge under 
mild conditions. To achieve this result, we first show that, if 
A > 0, the solution obtained by GMP satisfies the restricted set 
condition in Definition 2. 

Lemma 4. Let x* be the accumulation point of GMP with 
A > 0, X* is also the optimal solution to minx ^llb — Ax|p + 
A||x||i. Suppose T is the support of the ground truth x, let 
h = X — X*, then h satisfies the restricted set Tc = {h G 
: ||hr-||i < i:»||h7-||i}, with D^^^ and €>0. 

From Lemma ID with appropriate selected A > 0, the 
sequence {x*} generated by GMP will converge to the global 
solution of LASSO. Suppose x* is the optimal solution, /(x*) 
is the optimal function value and e is the regression error for 
LAS SO. Let I* be the support of x* and assume \I* \ < k, then 
/(x*) will exponentially approach to ./(x*) if Sk+tg < 1/2. 

Theorem 5. Let {x*} be the sequence generated by GMP 
with X > 0, we have /(x*) > C/(x*) = C(i||e||2 + 
A||x*||i), where C > 1. Furthermore, if Sk+cg < 1/2 
holds for L > 1, GMP can exponentially converge if t < 
L. Specifically, we have /(x*'''^) < i^/(x*), where v — 

When A is arbitrarily small (where A > 0), without proper 
stopping conditions, GMP will converge to the point where 
||a|| is very small. Since the regression error ^ = a holds, 
the optimal solution is meaningless for the noise cases if 
A is arbitrarily small. In this sense, we suppose GMP is 



stopped once x* sufficiently approaches to the ground truth 
fc-sparse signal x. Let e be the ground-truth additive noise, 
the following theorem indicates that GMP will exponentially 
approach to /(x) = 

Theorem 6. Let {x*} be the sequence generated by GMP 
with arbitrarily small A(A > 0), suppose /(x*) > C/(x) = 
f||e|p (where C > 1), if Su+^g < 1/2 and t < l, GMP 
can exponentially converge, namely /(x*+^) < i^/(x*), where 

- - (i - M iJ^%'^tr:l]J (' - *)^) « — ■ 

Remark 2. The convergence rate in the traditional CCP 
is unknown Ai6l/ . Therefore Theorem \5\ and |6] improve the 
convergence result in / |36|/ . In addition, to prove Theorem |5] 
the exact solution to the master problem is not necessary. 
Instead, a good approximation is sufficient. Furthermore, with 
warm-start, several iterations are enough for the inner loop. 
Finally, to achieve the exponential convergence rate, g is 
desired to be several times smaller than k. 

The convergence analysis for GMP above matches the re- 
sults in 1271 . where the geometric convergence rate of OMP 
type methods under restricted conditions has been revealed. 
However, GMP is much faster than OMP type methods in 1271 . 
||28l , ||29l since it can reduce the number of matrix- vector 
product while maintaining the geometric convergence rate. 
Notice that, the motivation of GMP is much different from OMP 
type methods. Typically, the OMP type methods are based on 
the loss function and it is not straightforward for them to solve 
model (|8]l with £i-regularization (where A > 0). 

3.7 Sparse signal recovery guarantees 

The sparse signal recovery conditions, which also guarantee 
the uniqueness of the global solution, are important in the 
context of compressive sensing, which is usually expressed 
in terms of RIP constant. Based on Lemma IH the following 
recovery guarantee holds for GMP. 

Theorem 7. GMP with proper A > will recover the signals 
if the restricted isometry constant ak < 0.307 — v, where v 
can be arbitrarily close to 0. 

When A is arbitrarily small, particularly for A = 0, a 
good RIP condition for the exact signal recovery is difficult 
to achieve, which we leave for future study. However, in 
practice, it has similar performance with GMP under properly 
selected A, which will be shown in the experiments. By 
comparison, CoSaMP, SP, AIHT and OMPR can recover any 
fc-sparse signal provided with 5ik < 0.35, Ssk < 0.35, 
Ssk < l/\/32 and 62k < 0.499, respectively However, 
except AIHT, the above conditions are also the corresponding 
convergence guarantees. In addition, to achieve the guaranteed 
performance, a good estimation of the sparsity fc is necessary 
for them, which restricts them for many applications. 

4 Related Work 

In recent years, many fast £1 -algorithms have been developed. 
Among them, the alternating direction method (ADM) and the 
augmented Lagrange multiplier (ALM) method have shown 



7 



good performance on solving problem (l3) with noise; 
while the fast iterative shrinkage-threshold algorithm (FISTA) 
has shown good performance on solving the LASSO problem 
(|4]l ||42| . To improve the efficiency, an elegantly designed 
parallel coordinate descent method, named as Shotgun, 
has been proposed [|43l to improve the efficiency. Another 
recent work, named as S-Ll, uses a screening test to predict 
the atoms with zero weights and reduces the computational 
cost with random projections [44 1. Furthermore, a proximal 
gradient homotopy (PGH) method is proposed to solve a 
series of strongly convex problems by gradually decreasing 
the regularization parameter from an initial guess Ao to the 
target A such that the geometric convergence rate can be 
maintained for each subproblem lfT9l . However, the efficiency 
of the above mentioned £i methods is still limited due to the 
frequent calculations of Ax and A^^ (both of them scales 
0{ran)). On the contrary, GMP solves the LASSO problem 
using a greedy strategy, and only calculates A^^ several 
times to select the most-active atoms. In addition, each master 
problem optimization maintains a geometric convergence rate 
and scales 0{\I\n), where I is the index set of the selected 
atoms. Therefore, with warm-start, GMP can be much more 
efficient than the ii methods which needs frequent calculation 
of Ax and AJ Finally, when tackling a large number of 
signals with medium size m, the computational cost can be 
further reduced using a batch mode implementation, which 
will depicted in Section |6] 

5 Numerical Experiments 

In this section, we will compare the performances of GMP and 
SGMP with the following baseline methods: 

• Four well-known greedy pursuit algorithms: OMP, AI HlJl, 
SP0 and OMPR are included as baseline methods. In 
OMPR, it needs to calculate z = x+77A^(b— Ax), where 
Tj is learning rate ll23l . The setting of rj is very important 
for its performance ||231 . In ||231 . a feasible range for rj is 
provided if A satisfies the RIP condition. Unfortunately, 
if A is not well scaled, the scale of A^ (b — Ax) may 
vary a lot and the setting of rj will be difficult^. Motivated 
by GMP, we can adaptively set -q using the CGD rule. To 
distinguish, we name OMPR with the adaptive parameter 
setting as the OMPRA. 

• Four state-of-the-art ^i-solvers: Shotgur@ is a fast par- 
allel coordinate descent solver which is implemented in 
C++ 1431 . F I S T^l^l uses the accelerated proximal gradient 
method with continuation technique P31 . B6l : PGIr|luses 
homotopy method to improve the convergence speed |[T9l : 
S-L10 adopts a screening test to predict the zero entries 
to improve the decoding efficiency |44|. 

In the experiments. Shotgun is conducted on Intel(R) 
Core(TM) i7 CPU (8 cores) with 64-bit Linux OS, while 

3. http://www.personal.soton.ac.uk/tblm08/publications.html. 

4. http://sites. google. coni/site/igorcan"on2/cscodes. 

5. Interesting readers can refer to 1231 for the detailed discussions on rj. 

6. http://www.select.cs.cmu.edu/projects. 

7. http://www.eecs.berkeley.edu/~yang/software/llbenchmark/index.html. 

8. http://www.eecs.berkeley.edu/~yang/software/llbenchmark/index.html. 

9. http://www.princeton.edu/~zxiang/home/index.html. 



the other methods are conducted on 64-bit Windows op- 
erating system(OS) with the same computer configuration. 
For fair comparisons, except S-Ll, all methods are written 
in C-H-. For S-Ll, which is written in Matlab, we run 
it in parallel mode on the multi-core machine. A Matlab 
interface of the C-H- implementations is available on website: 
https://sites.googIe.com/site/sparsecodingprojects/home. 

5.1 General Experimental Settings 

The numerical simulation settings follow the strategy in ||25]| , 
|1T9| . Typically, we study compressive sensing problems based 
on Gaussian random matrices of two different scales, namely, 
A e X 2 ■ ^jjj A G ' ^ ^ .In addition, three types 
of sparse signals. Zero-one signal (denoted by and each 
nonzero entry is either 1 or -1), Uniform signal (denoted 
by Su and each nonzero entry is sampled from a uniform 
distribution 1)) and Gaussian signal (denoted by Sg and 

each nonzero entry is sampled from a Gaussian distribution 
A/'(0, 1)) will be studied. The measurement b is generated by 
b = Ax + ^ with Uniform noise sampled from [—0.01,0.01]. 
The decoding performance and decoding time are reported 
to demonstrate the performance. Here the decoding perfor- 
mance is measured by the empirical probability of successful 
reconstruction (EPSR), which denotes the ratio of successful 
decoding times over M independent experiments ll25l . For £i- 
methods, the de-biasing technique (implemented by CGD) is 
used to improve the decoding performance ll45l . B6l . 

We set A to 0.005| |A^b||oo for methods by default, 
which is suggested by many li packages 1451 . For AIHT, 
SP OMPR, the estimation of the sparsity k is required. In the 
simulation, where we know the ground-truth k, we estimate 
it by /c = 1.2k, where k is the ground-truth. For GMP and 
SGMP, motivated by the fact that li methods can recover any 
k sparse signals with n = 0{k\og{m)) measurements HI, ||4J, 
|[T3l . we intuitively set g for GMP and SGMP by 

Q = n/{r\og(m)), (25) 

where r > 3. In those applications where k is known in 
advance, g can be set by g = k/r with r > 6. For SGMP, we 
set the subspace search length parameter uj — 4. The learning 
rate rj in OMPR is set to 0.7. Finally, we stop all the algorithms 
once ii^ip < 1.0 X 10~^ for fair comparisons. We keep the 
default settings of other parameters for baseline methods as in 
their original implementations. 

The numerical experiments are organized into 2 parts. 
Firstly, the convergence properties of GMP with A > 
are studied. Secondly, thorough comparisons are conducted 
between GMP and other baseline methods. 

5.2 Performance of gmp (A > 0) 

5.2. 1 Influences of g on gmp 

In this experiment, we study the influences of g on 
(S) GMP on Gaussian random matrix A e x2 • ^ Since 
rt/log(TO) = 113, we set a baseline estimation for g by 
p = [??/(41og(TO))] = 28. We also test the other two 
parameters: \p/2] and 2g. Moreover, a 160-sparse Zero-one 
vector X is generated as the ground truth. Finally, PGH is 
adopted as the baseline method. 
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Fig. 1. Convergence properties of gmp(mi) and pgh 
on A e ' , where x is Zero-one sparse vector of 

sparsity ||x||o = 160. For gkp {g) , g is set to 28. Tine 
computation time (in seconds) for various metlnods is: 
0.921, 0.172, 0.156, 0.109, 0.109, 0.124, 0.094, 0.686 
and 0.219, respectively. 



Fig. |l(a)| and Fig. |l(b)| present the objective value evolutions 
w.r.t. the total proximal iterations and the outer iterations, 
respectively. The computation time of various methods are 
listed in the figure caption. From Fig. [T] we have several 
observations. First of all, ( S ) GMP with different g can achieve 
faster convergence rate than PGH. However, SGMP generally 
converges faster than GMP under the same g, which verifies 
the effectiveness of subspace exploratory search. In addition, 
although GMP converges faster with larger g at the beginning, 
it may have slower convergence speed at the final iterations. 
More critically, the total computational cost may increase a 
lot. For example, GMP needs 0.686 seconds to converge. The 
reason is that if g is too large, some irrelevant atoms may be 
wrongly included, and the computational cost of inner problem 
will also increase. For smaller g, although the number of 
outer iterations will increase, the computational cost for the 
inner problem is small. Accordingly, the total computational 
cost may be very small. Recall that GMP requires 0{mn) to 
calculate A^o; in each outer iteration, a good setting of g 
that well trades off the cost of A^cc and the master problem 
optimization is important. As shown in Fig. [T] the estimation 
rule given in equation dZSl l shows promising results. 



5.2.2 Influences of \ on gmp 

GMP can produce different matching pursuit algorithms using 
arbitrarily small A. To verify the performances of different A, 
we set a baseline A by A = 0.005||A^b||oo- Besides, two 
smaller A, 10~^A and lO^^A, are also studied. In addition, 
we set g = 7i/(41og(m)) for all GMP. PGH and FISTA are 
adopted as the baseline methods. Two types of signals, namely 
the Zero-one signal and Gaussian signal are studied. For each 
type of sparse signal, we vary the sparsity k to obtain different 
sparse vectors. For each fc, we run I\I = 100 independent 
trials. 

The decoding performance and decoding time of each algo- 
rithm are reported in Fig. |2] From Fig. |2] all methods attain the 
similar decoding performance on Zero-one signals. However, 
GMP with various A shows much better efficiency than PGH 
and FISTA. In addition, SGMP shows sUghtly better efficiency 
than GMP with larger k. On Gaussian signals, ( S ) GMP with 
various A show better decoding performance than PGH and 
FISTA. Similar to Zero-one signal, GMP with various A also 
show much better efficiency than PGH and FISTA for a broad 
range of k. Finally, SGMP also outperforms GMP in decoding 
performance, which verifies the effectiveness of the subspace 
search. 
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Fig. 2. Sparse reconstruction results on A e 



From Fig. |2] FISTA shows similar decoding performance 
to PGH, but attains worse efficiency in all cases. Therefore, 
in the following comparisons, we will exclude FISTA for 
space issues. At last, in order to do comparisons with other 
MP algorithms, only GMP with A = will be studied later, 
which makes no harm to the completeness of the comparsion 
since GMP with a wide range of A shows close performance. 
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Fig. 3. Convergence comparisons on non-RIP problems. Tine number in bracket in Fig. 3(a) and 3(b) denotes the 
estimation of sparsity k or g. The sparsity of tine solutions obtained by gmp and sgmp are both 72 when converge. 



5.2.3 Convergence on Non-RIP Problems 

According to Theorem |4] GMP can surely converge. On the 
contrary, some OMP variants only converge under restricted 
conditions. To demonstrate this issue, we compare the per- 
formance of GMP with SP, CoSaMp, AIHT and OMPRA on 
a synthetic problem, where the restricted condition does not 
hold. Typically, we generate an non-RIP problem by setting 
A(:,41 : 80) = A(:, 1 : 40) (in Matlab notation), where 
the original A G is Gaussian matrix. In this case, 

the RIP condition doest not hold since there are correlated 
columns in A. Finally, a 40-sparse ground truth x is set by 
letting x(l : 40) = 1 and x(41 : 8092) = 0. The response 
vector b is generated by b = Ax without noise. Notice that 
this setting may happen in the face recognition task where 
similar or same images of a person can appear in A for many 
times. 

The convergence behaviors of various methods are shown 
in Fig. [3] From Fig. |3(a)| only GMP and SGMP achieve the 
global solution (not unique in this case), while other methods 
can only obtain a local solution. Fig. |3(b)| shows the objective 
value difference of successive iterations, where we can observe 
that, only AIHT, GMP and SGMP can monotonically decrease 
in objective value (since the difference is always greater than 
0). The comparisons above verify the previous arguments on 
the convergence of the improved OMP variants, namely, these 
algorithms require restricted conditions to converge. On the 
contrary, GMP can surely converge. 

To further demonstrate the convergence when RIP con- 
dition does not hold, we show an additional example on 
ro-one signals. Here we choose 
large k from {300, 310, 320, 330, 340, 350, 360} such that the 
RIP condition for the exact signal recovery will not hold for 
any algorithms. Therefore, all algorithms cannot successfully 
recover the original sparse vector. For simplicity, we only 
record the final objective value /(x*) = ||b — Ax*||/2 with 
different k in Fig. |3(c)| From Fig. |3(c)[ the regression error 
for (S)GMP is very small, indicating that (S)GMP indeed can 
converge to a global solution w.r.t. a such that the regression 
error is minimized. On the contrary, both SP and AIHT can 
only get local solutions. 



5.3 Sparse Recovery Performance Comparisons 

In this experiment, we test the decoding performance and 
efficiency of all methods on a relatively small problem A e 
r2 x2 ^ where Shotgun and S-Ll work in parallel. For 
each k, we run M = 100 independent trials. The empirical 
probability of successful reconstructions and the decoding 
time for the three kinds of signals are presented in Fig. 5] 
where the following observation can be obtained. Firstly, on 
Zero-one signal, SGMP and the ^i-based methods show the 
best decoding performance while SP shows better decoding 
performances over other MP methods. Secondly, on Uniform 
and Gaussian signals, SGMP and OMP show relatively better 
decoding performance than other methods. Morever, in all 
cases, SGMP shows better decoding performance than GMP. 
Thirdly, for AIHT, since it is easy to get stuck into local 
minima, it shows relatively worse decoding performance than 
other methods. Also, MP algorithms are faster than Shotgun, 
a well designed parallel £i method. Finally, although OMPR 
shows relatively better RIP condition ||23l, it is sensitive to its 
learning rate rj according to our study. From the experiments, 
OMPRA that uses adaptive learning rate improves OMPR greatly 
on all problem settings both on the decoding performance and 
decoding efficiency, but it is still worse than GMP and SGMP. 
Finally, in general, PGH shows the best efficiency than other £i 
methods, such as Shotgun and S-Ll, but it performs worse 
than (S)GMP. 

In the final experiment, we focus on the scalability of 
various methods on a larger problem A G . Here only 

the Gaussian signals are studied. For computational issues, we 
only compare SP, AIHT, GMP, and SGMP with 10 trials for 
each parameter k. Both decoding performance and time of 
various methods are shown in Fig. |5] Similar observations 
are also obtained to the above small-scale problem. Again, 
GMP and SGMP achieve the best decoding performance over 
all methods. 

6 Face Recognition by Sparse Recovery 

6.1 Problem Setting 

Recently, face recognition by SR has been widely studied by 
many researchers ||2], ll46l . ||481 . The basic assumption of SR 
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for face recognition is that any testing image lies in the sub- 
space spanned by the training images of the same person ||2l, 
II2TI . 1461 . As the number of training images per class is 
usually small, the testing image is assumed to be sparsely 
represented by the training images. Mathematically, given a 
testing image b, it is assumed to be sparsely represented by the 
training image set A e by solving problem (|3]l, where 

n is the number of pixels and m is the number of images. 
However, directly solving (|3) is computationally expensive 
when 71 is very large [ID, ET\ . ll46l . To address this issue, 
a random matrix $ € < n) is introduced to reduce 

the dimension of images JJ). 



However, several recent papers llSTI . Il22l argued that the 
sparsity is not necessary on improving the recognition perfor- 
mance II2TI . II22I but increases the computational cost 1201 . 
| |2T| . Specifically, in 1211 . the authors suggested that directly 
solving ||b — Ax|p without any dimension reduction, denoted 
as L2, can achieve better recognition rates and faster predic- 
tions; while in l22l . the authors showed that solving a least 
square problem ||b - Ax|p + fHxlp, denoted by L2-L2, 
can achieve better recognition rates. The efficiency of L2 
and L2-L2 is the major advantage over SR. In L2, since 
X = R^^Q^b, one can pre-compute R~^Q^ and store it in 
memory, where A = QR denotes the QR decomposition and 
R^^ denotes the pseudo inverse. Then fast predictions can 
be achieved using simple matrix-vector product. Similarly, in 
L2-L2, as (A^A + AI)^^ can be pre-computed, the solution 
X = (A^A + AI)~^A^b can be calculated very quickly. 
The second critical issue for £i-method is the difficulty on 
the trade-off setting between sparsity and reconstruction error 
||b — Ax IP by tuning the parameter A l20l . Without proper 
setting of A, the enforcing of high sparsity may even degrade 
the performance ET\ . Il20l . Regarding the these issues, two 
questions are raised. 

• Can we reduce the difficulty of the trade-off parameter 
settings in ii methods? 

• Does sparsity really help to improve the performance? 
Before answering the above questions, we first introduce the 
batch-mode implementation of GMP. 
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6.2 Batch Mode gmp 

Recall that L2 and L2-L2 methods pre-compute the matrix 
inverse (or pseudo-inverse) to achieve fast predictions for a 
batch of testing images. The question is: can we do the similar 
implementations for GMP? 

Recall that, in GMP, the master problem scales 0(|I|n) 
complexity and the worst-case analysis scales 0{mn), where 
I is the index set of the selected atoms. Apparently, the worst- 
case analysis dominates the complexity of the whole algorithm 
if n is very large. Fortunately, motivated by |47 |, this compu- 
tation burden can be significantly reduced using a batch-mode 
implementation BGMP. Notice that A^a = A^(b - Axxx) 
can be calculated via 



A ' a = [A ' b] - [A ' Ai]xi. 



(26) 



If we pre-compute [A^b] and Q = A^A and store them 
in the memory, we can calculate AJ cx by AJ a = [A^b] — 
Qjxi, where Qx denotes the columns of Q indexed by I. 
Now the computational cost for A^a reduces to 0{m\I\), 
which will save a lot of computation when k n BTI . With 
this strategy, when there are a lot of signals to be decoded, 
the computational cost can be much reduced. However, since 
A^A needs 0{m?) space for storage, this strategy is only 
applicable for medium size problems. If m is extremely large, 
Q will be too large to store. In this case, one cannot store the 
matrix inverse for L2-L2 and L2, either. 

To verify the effectiveness of BGMP, we firstly compare 
its performance with BOMP, which also uses the batch-mode 
implementation 1471 . In the simulation, we generate a Gaus- 
sian random matrix A G .In addition, we generate 
Gaussian sparse signals with sparsity k from {400, 450, 500, 
550, 600}. For each fc, 200 sparse signals are generated. 
Finally, the measurements b is produced by b = Ax + ^ with 
Gaussian noise sampled from A/'(0,0.05). The total time (in 
seconds) spent by BGMP and BOMP on 200 signals as well as 
the Mean Square Error (MSE) are reported in Table [T] Table 
[1] shows that BGMP is about 7-16 times faster than BOMP. 
Moreover, it gains better or comparable MSE than BOMP for 
all k. 

TABLE 1 

Efficiency Comparison Between bgmp and bomp 





k 


400 


450 


500 


550 


600 


BOMP 


Time 


434.27 


546.70 


680.96 


835.72 


1014.93 


MSE 


8.61E-04 


').89E-04 


l.llE-03 


1.25E-03 


1.40E-03 


BGMP 


Time 


55.06 


55.79 


56.79 


59.51 


59.91 


MSE 


3.72E-04 


3.55E-04 


3.40E-04 


3.49E-04 


3.38E-04 




Speedup 


7.89 


9.80 


11.99 


14.04 


16.94 



6.3 Experimental Results 

Now we apply BGMP to face recognition. In addition, since 
thorough comparisons between £i methods and other methods 
have been done in ||2l, ESI, ll48l . in this experiments, we only 
adopt L2 and L2-L2, PGH and BOMP as the basehne methods. 
Notice that those MP algorithms that require an exact guess of 
k, such as SP and OMPR, are not suitable for face recognitions. 
Two reasons account for this. Firstly, for general problems, it 
is difficult to set k. On the one hand, if k is too small, the 
reconstruction error will be large and the performance will 



degrade. On the other hand, if k is too large, the computational 
cost will increase. Secondly, the training image set A may not 
satisfy the RIP conditions, hence these algorithms may not 
converge or only converge to a local solution. 

Two widely used databases, namely, the Extended using and 
AR database are studied. The Extended using consists of 2,414 
frontal face images of 38 subjects ET\ . Il48l . They are captured 
under various lighting conditions and cropped and normalized 
to 192 X 168 pixels. In our experiment, we take 62 images 
per person, resulting 2,356 images in total. The AR dataset 
consists of over 2,600 frontal images of 100 individuals il49]| . 
|2J, 148|. Each image is normalized to 80 x 60 pixels. We set 
the stopping condition ||b - Ax|p/||b||^ < 10"^ for BGMP 
and k = 200 for BOMP. Furthermore, considering that there 
may be some images that cannot be sparse-represented by the 
training images, we constrain the sparsity k < 600. Finally, 
since the number of measurements n can be very large, the 
estimation of g in equation ( |25] | is not suitable. Therefore, we 
directly set g = 10 for all experimental settings. 

6.3. 1 Face Recognition with Different Dimensions 
Followed by 1211 . we randomly choose half images of each 
person as training set and the rest images as testing set. To 
study the influence brought by the number of measurements 
(pixels), we down-sample the images with sampling rate 
Pd, where pd is chosen from {1, 1/2, 1/3, 1/4, 1/5, 1/6, 1/7} 
for YaleB images and {1, 3/4, 2/3, 1/2, 1/3} for AR images. 
Obviously, the dimension of new image vector will be p^ 
of the original image vector. The prediction accuracies on 
YaleB and AR images are shown in Table |2] To measure the 
difference of the result, the Wilcoxon test with 5% significance 
is conducted between BGMP and the winner of L2 and L2-L2, 
and 1 indicates the significant difference. 

From Table |2] we can observe that, on YaleB dataset, 
BGMP shows significantly better accuracy than L2 and L2-L2 
methods under down-sampling rate 1/5,1/6 and 1/7 and 
comparable or slightly better performance under other down- 
sampling rates. On AR dataset, BGMP performs significantly 
better than L2 and L2-L2 under down-sampling rate 1,3/4 
and 2/3. Moreover, BGMP and BOMP can achieve more stable 
performance on a wide range of down-sampling rates than 
L2 and L2-L2. For L2 method, it shows very unstable 
performance under some sampling rates. For example, on 
AR dataset, it only achieves 73.23% testing accuracy under 
down-sampling rate 1/2. The possible reason is the unstable 
pseudo inverse on ill-conditioned matrix. Its performance can 
be possibly stabilized by adding a regularization term AI to 
A^A, namely L2-L2. From Table ID L2-L2 indeed shows 
better stable performance under different sampling rates than 
L2-L2. But it performs worse than BGMP and BOMP on AR 
database with large number of measurements. 

The major complaint for £i methods is their unbearable time 
cost 121, l46l . l2n . To demonstrate the efficiency, we report the 
total time spent by various methods in Table |3] From Table |3] 
without down-sampling, PGH, the considered state-of-the-art 
£i solver, needs several hours to predict all testing images, 
which is unbearable for real-world applications. However, 
BGMP can finish the prediction within one minute, which is 
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TABLE 2 

Prediction Accuracy on Two Databases 





Extended YaleB Database 


AR Database 


Pd 


1 


1/2 


1/3 


1/4 


1/5 


1/6 


1/7 


1 


3/4 


2/3 


1/2 


1/3 


L2 


0.9876 


0.9868 


0.9831 


0.9792 


0.9371 


0.9561 


0.9621 


0.9466 


0.9301 


0.9108 


0.7323 


0.9638 


L2-L2 


0.9898 


0.9859 


0.9827 


0.9818 


0.9783 


0.9730 


0.9723 


0.9524 


0.9504 


0.9532 


0.9574 


0.9692 


PGH 


0.9897 


0.9843 


0.9826 


0.9846 


0.9815 


0.9760 


0.9658 


0.9657 


0.9650 


0.9715 


0.9679 


0.9656 


BOMP 


0.9904 


0.9897 


0.9861 


0.9844 


0.9786 


0.9799 


0.9734 


0.9742 


0.9744 


0.9738 


0.9738 


0.9619 


BGMP 


0.9911 


0.9892 


0.9873 


0.9849 


0.9817 


0.9787 


0.9761 


0.9739 


0.9757 


0.9715 


0.9723 


0.9672 


Wilcoxon 














1 


1 


1 


1 


1 


1 


1 






TABLE 3 

Total Time Spent on Two Databases (in seconds) 





Extended YaleB Database 


AR Database 


Pd 


1 


1/2 


1/3 


1/4 


1/5 


1/6 


1/7 


1 


3/4 


2/3 


1/2 


1/3 


L2 


71.33 


24.91 


6.29 


3.51 


2.42 


1.14 


0.72 


13.34 


4.39 


3.16 


3.28 


2.19 


L2-L2 


11.36 


6.85 


4.13 


2.40 


2.32 


2.22 


1.69 


3.75 


3.04 


3.10 


2.58 


1.99 


PGH 


5559.53 


4863.18 


2195.03 


1383.28 


822.11 


627.95 


383.86 


5229.75 


2812.96 


2178.91 


1324.59 


557.65 


BOMP 


139.69 


99.88 


98.05 


89.83 


89.95 


90.41 


87.60 


108.52 


98.84 


98.60 


97.25 


95.58 


BGMP 


39.72 


17.05 


12.94 


7.86 


7.62 


6.53 


6.19 


14.29 


10.87 


10.20 


7.14 


4.57 



TABLE 4 

Average Sparsity on Two Databases 





Extended YaleB Database 


AR Database 


Pd 


1 


1/2 


1/3 


1/4 


1/5 


1/6 


1/7 


1 


3/4 


2/3 


1/2 


1/3 


BOMP 


200 


200 


200 


200 


200 


200 


200 


200 


200 


200 


200 


200 


PGH 


164 


165 


165 


162 


156 


158 


163 


133 


130 


127 


135 


124 


BGMP 


167 


165 


160 


155 


155 


149 


143 


189 


190 


188 


194 


201 



comparable with L2-L2 and L2. Particularly, BGMP can be 
up to 500 times faster than PGH 1191 . and it also performs 
3-10 times faster than BOMP method. That is to say, under 
fair implementations in batch mode, BGMP can achieve the 
comparative efficiency with L2-L2 and L2. 

The remaining question is: does the sparsity help to im- 
prove the recognition performance? To answer this question, 
we Hsted the average sparsity of BGMP, PGH, and BOMP 
in Table ID Notice that the solutions obtained by L2 and 
L2-L2 are not sparse. From Table H] BGMP, PGH and BOMP 
indeed achieve sparse solutions. However, BGMP shows better 
prediction accuracy than PGH method by directly controlling 
the reconstruction error ||Ax — b|p. Moreover, from Table 
m BOMP has larger number of nonzeros than BGMP in most 
case, but it does not show much improved prediction accuracy 
over BGMP. Therefore, we can conclude that the preserving 
of the sparsity at least does not degrade the performance if 
we control the reconstruction error ||Ax — b|p to be small 
enough. More importantly, on AR dataset, BGMP outperforms 
L2 and L2-L2 with enough measurement (pixels), therefore, 
the sparsity indeed help to improve the recognition rates on 
AR dataset. 

6.3.2 Face Recognition witli Different Training Samples 
To further demonstrate whether the sparsity can help to im- 
prove the recognition rates or not, an additional experiment on 
YaleB database is conducted. Recall that in Table |2l when half 
images are used as the training set, BGMP does not perform 
significantly better than L2 and L2-L2 on YaleB database un- 
der the down-sampling rate pd ~ 1/4. But what will happen 
if the number of training images increases? Notice that with 
more training images, the matrix A^A will become more ill- 
conditioned. Let pt be the ratio of the training image number 



TABLE 5 

Prediction Accuracy on YaleB witin Different Number of 
Training Images 



Pt 


0.55 


0.60 


0.65 


0.70 


0.75 


0.80 


L2 


0.6352 


0.9350 


0.9330 


0.9684 


0.9764 


0.9815 


L2-L2 


0.9814 


0.9814 


0.9823 


0.9827 


0.9843 


0.9872 


BGMP 


0.9848 


0.9887 


0.9887 


0.9908 


0.9911 


0.9925 


Wilcoxon 





1 


1 


1 


1 


1 



over the total image number, then the ratio of the testing 
images will be 1 — pt. To demonstrate the performance with 
different number of training images, we fix the down-sampling 
rate pd = 1/4 and vary pt € {0.55,0.60,0.65,0.7,0.75,0.8}. 
Here the dimension of each image vector is 2016, 16 times 
smaller than the original image vector 

The prediction accuracy and the total time spent with 
different number of training images are shown in Table |5] and 
|6] respectively. From Table |5] BGMP performs significantly 
better than L2 and L2-L2 when pt > 0.60, which indicates 
that BGMP can achieve more stable performance when A^A 
becomes more ill-conditioned. Accordingly, we can conclude 
that the sparsity can help improve the recognition rates in 
this setting. Again in Table |6l the total time spent by BGMP 
is comparable to L2 and L2-L2. 

TABLE 6 

Total Time Spent on YaleB with Different Number of 
Training Images (in seconds) 



Pt 


0.55 


0.60 


0.65 


0.70 


0.75 


0.80 


L2 


2.48 


2.95 


3.02 


3.16 


3.56 


6.02 


L2-L2 


2.20 


2.51 


3.50 


3.94 


3.21 


6.06 


BGMP 


10.65 


6.11 


5.71 


4.93 


4.23 


2.85 
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7 Conclusions 

In this paper, a new convex relaxation scheme for sparse 
recovery problem is proposed by introducing a 0-1 support 
selection vector Based on the new convex formulation, a 
general matching pursuit (GMP) algorithm which can surely 
converge is naturally introduced. It shows that GMP can 
converge exponentially if (Tk+tg < 0.5. In addition, GMP 
with an ^i-regularizer can recover the fc-sparse signals if the 
restricted isometry constant cTfe < 0.307 — i^, where i' can 
be arbitrarily close to 0. Finally, SGMP with subspace search 
can further improve the performance of GMP . In general, GMP 
can be much faster than traditional ii methods. Furthermore, 
the computational cost can be further reduced by using batch- 
mode implementation. 

As a matching pursuit algorithm, GMP differs from OMP 
variants in several aspects. At first, GMP is motivated by 
solving a convex QCQP problem with the central cutting 
plane algorithm B36I . Therefore, it can surely converge and 
the global convergence is also guaranteed. On the contrary, 
except AIHT and OMP, most of them need restricted conditions 
for convergence. For AIHT, although it can monotonically 
decrease in objective value, experiments show that it is easy to 
get stuck into local minima ||3T| , 1321 . In addition, the standard 
OMP is a special case of GMP but with unbearable computation 
cost for large k. The computation issue also exist for other 
OMP type methods as they only include one element each 
iteration 1281 . l29l . Secondly, unlike most of the OMP variants, 
such as CoSaMP, SP and OMPR l24l, ||25l, ED, GMP dose 
not need to specify an exact estimation of the target sparsity. 
Thirdly, ROMP, SP, CoSaMP and OMPR need the replacement 
of atoms l30l . l25l . l23l . while GMP gradually increases g 
atoms without any atom replacement. Finally, SGMP that uses 
subspace search can further improve the performance of GMP, 
which shares the similar strength with SP. However, they 
are different since SP includes additional k atoms at each 
iteration B25I while SGMP only need to include 7]g{rig < k) 
atoms. 

Comprehensive numerical experiments demonstrate the ef- 
fectiveness and efficiency of the proposed methods. In addi- 
tion, by applying GMP to face recognition, the experimental 
results on two widely used datasets, namely Extended YaleB 
Database and AR database, show that the batch mode GMP can 
achieve significantly better recognition rates than two competi- 
tors, namely L2 and L2-L2, on a wide range of settings with 
comparable computational cost. Particularly, batch mode GMP 
can be up to 20 times faster than batch-mode OMP and 500 
times faster than the considered state-of-the-art £i method, 
namely PGH lT9l . 
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